Fast Online Clustering with Randomized Skeleton Sets
نویسندگان
چکیده
We present a new fast online clustering algorithm that reliably recovers arbitrary-shaped data clusters in high throughout data streams. Unlike the existing state-of-the-art online clustering methods based on k-means or k-medoid, it does not make any restrictive generative assumptions. In addition, in contrast to existing nonparametric clustering techniques such as DBScan or DenStream, it gives provable theoretical guarantees. To achieve fast clustering, we propose to represent each cluster by a skeleton set which is updated continuously as new data is seen. A skeleton set consists of weighted samples from the data where weights encode local densities. The size of each skeleton set is adapted according to the cluster geometry. The proposed technique automatically detects the number of clusters and is robust to outliers. The algorithm works for the infinite data stream where more than one pass over the data is not feasible. We provide theoretical guarantees on the quality of the clustering and also demonstrate its advantage over the existing state-of-the-art on several datasets.
منابع مشابه
Improved COA with Chaotic Initialization and Intelligent Migration for Data Clustering
A well-known clustering algorithm is K-means. This algorithm, besides advantages such as high speed and ease of employment, suffers from the problem of local optima. In order to overcome this problem, a lot of studies have been done in clustering. This paper presents a hybrid Extended Cuckoo Optimization Algorithm (ECOA) and K-means (K), which is called ECOA-K. The COA algorithm has advantages ...
متن کاملRandomized Algorithms for Fast Bayesian Hierarchical Clustering
We present two new algorithms for fast Bayesian Hierarchical Clustering on large data sets. Bayesian Hierarchical Clustering (BHC) [1] is a method for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. BHC has several advantages over traditional distancebased agglomerative clustering algorithms. It defines a probabilistic model of the data a...
متن کاملFast online graph clustering via Erdös-Rényi mixture
In the context of graph clustering, we consider the problem of estimating simultaneously both the partition of the graph nodes and the parameters of an underlying mixture of affiliation networks. In numerous applications the rapid increase of data size with time makes classical clustering algorithms too slow because of the high computational cost. In such situations online clustering algorithms...
متن کاملExample-based skeleton extraction
We present a method for extracting a hierarchical, rigid skeleton from a set of example poses. We then use this skeleton to not only reproduce the example poses, but create new deformations in the same style as the examples. Since rigid skeletons are used by most 3D modeling software, this skeleton and the corresponding vertex weights can be inserted directly into existing production pipelines....
متن کاملIdentifying Facets in Query-Biased Sets of Blog Posts
We investigate the identification of facets of query-biased sets of blog posts. Given a set of blog posts relevant to a topic, we compare several methods for identifying facets of the topic in this set. Building on a clustering of a set of blog posts, we compare several cluster labeling methods, and find that a method that makes use of blog and blog search specific features outperforms other me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1506.03425 شماره
صفحات -
تاریخ انتشار 2015